CrowdER: Crowdsourcing Entity Resolution

نویسندگان

  • Jiannan Wang
  • Tim Kraska
  • Michael J. Franklin
  • Jianhua Feng
چکیده

Entity resolution is central to data integration and data cleaning. Algorithmic approaches have been improving in quality, but remain far from perfect. Crowdsourcing platforms offer a more accurate but expensive (and slow) way to bring human insight into the process. Previous work has proposed batching verification tasks for presentation to human workers but even with batching, a human-only approach is infeasible for data sets of even moderate size, due to the large numbers of matches to be tested. Instead, we propose a hybrid human-machine approach in which machines are used to do an initial, coarse pass over all the data, and people are used to verify only the most likely matching pairs. We show that for such a hybrid system, generating the minimum number of verification tasks of a given size is NPHard, but we develop a novel two-tiered heuristic approach for creating batched tasks. We describe this method, and present the results of extensive experiments on real data sets using a popular crowdsourcing platform. The experiments show that our hybrid approach achieves both good efficiency and high accuracy compared to machine-only or human-only alternatives.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Data Management Applications Using Microtask Platforms

Many data management problems are inherently vague and hard for algorithms to process. Take for example entity resolution, also known as record linkage, the process to resolve records for the same entity from heterogeneous sources. Properly resolving such records require not only the syntactic structure of the data, but also contextual semantics that are hard for machines to understand. To prop...

متن کامل

Crowdsourcing Entity Resolution: a Short Overview and Open Issues

Entity resolution (ER) is a process to identify records that stand for the same real-world entity. Although automatic algorithms aiming at solving this problem have been developed for many years, their accuracy remains far from perfect. Crowdsourcing is a technology currently investigated, which leverages the crowd to solicit contributions to complete certain tasks via crowdsourced marketplaces...

متن کامل

Compare Me Maybe: Crowd Entity Resolution Interfaces

We study the problem of enhancing entity resolution (ER) with the help of crowdsourcing. ER is the problem of identifying records that refer to the same real-world entity and can be an extremely difficult process for computer algorithms alone. For example, figuring out which images refer to the same person can be a hard task for computers, but an easy one for humans. An important component of c...

متن کامل

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

C3D+P: A summarization method for interactive entity resolution

Entity resolution is a fundamental task in data integration. Recent studies of this problem, including active learning, crowdsourcing, and pay-as-you-go approaches, have started to involve human users in the loop to carry out interactive entity resolution tasks, namely to invite human users to judge whether two entity descriptions refer to the same real-world entity. This process of judgment re...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2012